DQe-c is a modular tool developed in R statistical language for assessing completeness in EHR data repositories. The tool also performs a data model conformance test that pertain to data completeness. Each run of DQe-c produces a web-based report (.html document) that include visualizations of the data completeness test at a given time (or data load) and changes in key frequencies over time. Data preparation, analyses, and visualizations are performed under seven modules, where each module consists of one or more R scripts (Figure 1). Modules one to six perform data preparation and analyses and store their outputs as comma-separated flat files in the reports directory. The modular design increases flexibility of the tool for future improvements and facilitates interoperability.
dqec_workflow
The set up and execution module consists of three scripts. All necessary packages are loaded in lib.R, called by RUN.R script. DQe-c’s execution is governed by the RUN.R script, which initiates three modules one after each other (order of initiations are identified in Figure 1). Scripts within each module initiate their dependent scripts, respectively. The latest version of DQe-c (3.1) works on two common data models (CDM), PCORnet version 3 and OMOP version 5. The tool also operates on two Relational Database Management System (RDBMS), MS SQL Server and PostgreSQL, calling SQL queries from within R commands via JDBC/ODBC connection. This capability increases scalability of the tool against large EHR repositories.
To run DQe-c, the user needs to: (1) specify execution parameters in the RUN.R script, including CDM, SQL, organization name, and database specifications, and (2) set up SQL connection credentials in the keys.R script.
The third (and final) step to run DQe-c is to specify SQL connection information (i.e., data base driver, data base name, host address, and connection port) in from connectivity module. This module establishes the JDBC/ODBC connection with the respective RDBMS.
The clinical indicators module performs tests to count frequencies of patients without information on key clinical indicators. The development and addition of this module was inspired in a collaboration with DARTNet Institute,25 while implementing an earlier version of DQe-c into their data repository. Selection of clinical indicators in this module is flexible and can be customized based on local needs. This module initiates the data model module as a dependency.
The data model module performs data model-related completeness test. DQe-c version 3.1 checks for the existence of orphan foreign keys in the database, based on the relational constrains. This test looks at completeness from a conformance point of view. More data model-related checks will be added to the future versions of DQe-c through this module. The module initiates its dependent data preparation module.
Data preparation module operate the first data operations after the connection to the SQL is established. After the execution parameters are set in the RUN.R, script prep.R reads the respective data model template (OMOP v5 vs. PCORnet V3) and prepares the system to call respective SQL queries (MS SQL Server vs. PostgreSQL). Two comma-separated flat file provide the CDM templates for DQe-c to operate. The first step to expanding the tools functionality to other CDMs is to create a new CDM template and modify the data preparation module. This script creates a reference table for processing in freq.R, which counts and stores frequencies of rows and unique values in each rows. Results of these counts are added to the reference table and will be used by the missingness module for further processing.
The missingness module calculates percentage of missing values for each column of each tables available in the database. The results complete the reference table created in the data preparation model, and used by the visualization and presentation module to generate the DQe-c report.
The visualization and presentation module includes an R Markdown document that generates the HTML report from completeness tests conducted through DQe-c. This module uses the outputs of its preceding modules, as they are stored with specific names as comma-separated flat files in the reports directory. We will provide a brief description of DQe-c report in the next section.
Each run of DQe-c generates an HTML report that summarizes outputs from its data preparation and analytics in tables and graph visualizations. The report is organized in four sections.
The first section of the report presents a databased-level snapshot summary of the latest data loaded in the clinical repository. The summary includes a table that present a list of CDM tables, their availability status (in three categories: (1) available, (2) loaded but empty, (3) not loaded), Gigabyte size and number of rows for each table. These information are then presented in three visualizations (Figures 2, 3, and 4). Data for this section of the report was generated by the data preparation module.
This DQec report is generated from testing completeness in PCORnet_v3 data from DQe-cDemoPCORnet on 2017-01-17.
The table bolow provides a list of CDM tables provided (and not provided) in the data load.
The source data this table and the following graphics in this section are being generated from is load_details_PCORnet_v3_DQe-cDemoPCORnet_14-01-2017.csv
This figure shows which of the CDM tables was received (and not received).
The figure below shows a network visualization of the CDM data model, as well as highlighting the tables that are available in this load (legend is the same as in Figure 2).
Second section of the report illustrates the results of data preparation and missingness modules. This section profiles changes over time in primary keys across loads (Figure 5). This visualization can help to obtain a better understanding of changes in a given clinical data repository over loads.
The table below provides results of completeness test at the value/cell level.
TabNam = PCORnet_v3 table nameColNam = Column nameDQLVL = Level of importance for completeness test. (X: Extremely Important, H: Highly Important, L:Low Importance)FRQ = Frequency of rowsUNIQFRQ = Frequency of unique values in each columnMS1_FRQ = Frequency of cells with NULL/NA values or empty strings in each columnMS2_FRQ = Frequency of cells with characters in each column that don’t represent meaningful data – including, ‘+’, ‘-’, ’_‘,’#‘,’$‘,’*‘,’', ‘?’, ‘.’, ‘&’, ‘^’, ‘%’, ‘!’, '@', and ‘NI’.MSs_PERC = Percentage of overall missing data in each columnData for this table is generated from DQ_Master_Table_PCORnet_v3_DQe-cDemoPCORnet_14-01-2017.csv saved under report directory.
Figure below profiles changes in primary keys across loads as a measure of change in patient/record number over time.
Data for the figure is stored in reports/FRQ_comp_PCORnet_v3_DQe-cDemoPCORnet_14-01-2017.csv
Figures below show proportion of missing cells/values in each column of each table loaded. Figures are generated based on Table 2.
MS1_FRQ = Frequency of cells with NULL/NA values and empty strings in each columnMS2_FRQ = Frequency of cells with characters in each column that don’t represent meaningful data## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
##
## [[4]]
Figures below visualize number of unique key variables that are common in multiple PCORnet_v3 tables.
The Reference column on the right comes from the table in which the variable is a primary key, and therefore is a reference for all other tables.
Count_Out shows number of unique key variables that are not present in the reference table – e.g., person id from observation table that does not exist in person table.
Count_In represent number of unique key variables that are present in the reference table – e.g., person id from observation table that exist in person table as well.
Figure 5 shows the parcentage of patients missing specific key clinical indicators.
This is report is from DQe-c version 3.1
For questions and/or inquiries email: hestiri@mgh.harvard.edu